In this notebook, I'll apply topic modelling techniques using text embeddings with BERTopic. Topic modelling use a corpus of text to identify clusters or groups of common themes or similar words.
Traditionally, for topic modelling, we would use techniques like LDA (Latent Dirichlet Allocation) or LSA (Latent Semantic Analysis). However, in this notebook, I will use embeddings generated by models based in transformers to get a dense representation of the meaning of each text. In this case, each text represents one abstract from the NSF Research Awards Abstracts dataset.
I'll start by reading the data that was prepared by another notebook.
import pandas as pd
import os
abstracts_df = pd.read_csv(os.path.join('data', 'processed', 'abstracts.csv'))
# https://www.nsf.gov/awardsearch/showAward?AWD_ID=2053734&HistoricalAwards=false
abstracts_df.dropna(subset=['award_id', 'abstract'], inplace=True)
I'll train the topic model using BERTopic with a modification in the default value of min_topic_size. I increased the value to 50 to get more interesting topics with more abstracts in them. We use a vectorizer to remove stop words after getting the embeddings and finding the topics as advised in the section of tips and tricks
from bertopic import BERTopic
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english")
topic_model = BERTopic(verbose=True, min_topic_size=50, vectorizer_model=vectorizer_model)
topics, _ = topic_model.fit_transform(abstracts_df['abstract'].tolist())
num_topics = len(topic_model.get_topic_info())
print(f'# of topics discovered: {num_topics}')
Batches: 0%| | 0/412 [00:00<?, ?it/s]
2023-10-06 15:14:16,447 - BERTopic - Transformed documents to Embeddings 2023-10-06 15:14:34,508 - BERTopic - Reduced dimensionality 2023-10-06 15:14:34,897 - BERTopic - Clustered reduced embeddings
# of topics discovered: 45
I discovered 45 topics. Let's analyze the top 8 of them. According to the documentation, we can ignore the Topic -1that belongs to the outliers.
topic_model.get_topic_info().head(9)
| Topic | Count | Name | |
|---|---|---|---|
| 0 | -1 | 5224 | -1_research_project_using_students |
| 1 | 0 | 748 | 0_covid19_social_pandemic_health |
| 2 | 1 | 593 | 1_species_plant_research_project |
| 3 | 2 | 592 | 2_physics_stars_universe_matter |
| 4 | 3 | 518 | 3_theory_geometry_equations_algebraic |
| 5 | 4 | 400 | 4_stem_engineering_learning_education |
| 6 | 5 | 383 | 5_ice_climate_ocean_sea |
| 7 | 6 | 347 | 6_cells_cell_proteins_protein |
| 8 | 7 | 334 | 7_mantle_seismic_earthquakes_subduction |
topic_model.visualize_barchart(top_n_topics=8, n_words=10, height=700)
From the last visualization, I can see 8 different topics: 0) Pandemic COVID-19 1) Botanics/Ecology 2) Astro-physics 3) Maths 4) STEM Education 5) Climate Change 6) Micro-biology/Biochemistry 7) Geology (Seism and volcanic activity)
topic_model.visualize_topics(top_n_topics=num_topics)
Another way of visualizing the relationships among topics is using hierarchy generated by the HDBSCAN algorithm used to generate the topics
topic_model.visualize_hierarchy(top_n_topics=num_topics, width=1000)